Exploratory Data Analysis¶

Owner: Daniel Soukup - Created: 2025.11.01

The goal of this notebook is to explore the dataset, understand our target column and features (statistical properties, data quality) and plan for preprocessing and modelling.

Data Loading¶

Let's ensure the datasets are available:

Load the datasets:

We have ~200K rows for testing and ~100K rows for testing with 42 columns, the last being our target. Note that we are missing the column names.

Observations:

It is clear that we need to do a fair bit of cleaning:

Excerpt from the data dict:

A quick check confirms that there are no exact duplicate columns per se:

Column Mapping¶

Lets look at the high level stats first:

We'll use the data dictionary and number of unique values to map our columns to their proper names. We also need this information to make well-informed decisions on how the columns should be processed for modelling.

Based on this we put together our mapping:

About the instance weight:

The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should not be used in the classifiers, so it is set to "ignore" in this file.

More info from census.gov:

The base weight, which is the inverse of the probability of the person being in the sample, is a rough measure of the number of actual persons that the sample person represents. Almost all sample persons in the same state have the same base weight, but the weights across states are different. Selection probabilities may also differ for some sample areas due to field subsampling, which is done when areas selected for the sample contain many more households than expected. The base weights are then adjusted for noninterview, and the ratio estimation procedure is applied.

We won't be using this field for modeling as per the data dictionary recommendation.

We are ready to map the names:

Univariate distributions¶

We look at the univariate distribution of each column that will help us deciding on any preprocessing.

Numerical Fields¶

First, look at the numeric fields:

By a quick glance, we can recognize that some of these columns are codes actually hence better dealt with as categories:

The numbers encode independent values with no numeric relationship: veterans benefits, own business, and the two the recodes.

One might consider adding "num persons worked for employer" and "year" as well, given the concentration to only a few unique values.

Let's convert these and process with the rest of the categorical columns:

Look at high level stats:

We see some major skew in these fields and some curious values, such as the 9999 and 99999 for the dollar values.

Observations:

It is likely that either 9999 was used as a filler value in these columns or used as an artificial maximum to protect the privacy of some outlier data subjects.

These are extreme outlier values of the fields (based on the percentiles above). We also saw that there are a really high % of values falling exactly at 0:

This allows us to see the distribution better.

Proposal

In the future, we can apply log-transformation, scaling to these fields if the models require. We can alternatively consider binning to create categorical fields from these variables too (based on uniform or percentile bins).

Categorical fields¶

Lets consider now the categorical fields.

We have a highly imbalance distribution, with only 6% of train samples in the +50K category. This will likely need addressing, the classification models will be affected and in general, we expect worse performance on this minority class.

Lets see the rest of the columns:

We see most columns are again imbalanced with a high number of unique values and long tails.

If all columns are dummy encoded we get > 450 cols (once one of each category dropped):

In terms of encoding and processing:

Proposal:

Our preprocessing pipeline will implement these steps.

Missing Values¶

We see there is only a single column with missing values, for <0.5% of the rows:

From the data glossary provided:

Hispanic Origin Persons of Hispanic origin in this file are determined on the basis of a question asking if the person is Spanish, Hispanic, or Latino. If the response is ā€œyes,ā€ a follow-up question determines a specific ethnic origin, asking to select their (the person’s) origin from a ā€œflash cardā€ listing. The flash-card selections are Mexican, Mexican- American, Chicano, Puerto Rican, Cuban, Cuban American, or some other Spanish, Hispanic, or Latino group.

If these values are missing at random, dropping them likely makes little difference to the modelling.

The rows do not seem to be skewed with respect to our target or the 'race' column. However, in general we try to avoid dropping rows at all costs to avoid biasing the data inadvertently.

Proposal: Since the hispanic origin column already has a Do not know value, we can fill with that.

Duplicate Rows¶

Lets see how many duplicate rows we have:

This is already significant, but if we look at without our instance weight column:

We can see that without our weight and income column, we get the exact number of duplicate instances mentioned in the data dictionary.

Number of instances data = 199523 Duplicate or conflicting instances : 46716 Number of instances in test = 99762 Duplicate or conflicting instances : 20936

It is unclear where these are coming from and there is no ID to identify data subjects (as per the anonymization). Note that this is a significant number of rows to drop from our data and it could be that some of these duplicates are legitimate (two people with the same recorded attributes).

Note: the Census does put a fair effort into deduplication. https://www.census.gov/newsroom/blogs/random-samplings/2021/04/how_we_unduplicated.html

There are several reasons for duplicates in a census: We receive more than one response for an address. People are counted in more than one place because of potentially complex living situations. There might be an issue with the address — a housing unit is on our address list more than once or census materials are misdelivered. We use a special algorithm to resolve the first situation and a series of steps to resolve the second and third.

Proposal: we will drop these rows since there is no additional evidence that these are not corrupted (see data dict).

For our tree learning algorithm, having duplicate rows would amount to weighting the data points which is really done through the instance weight column already, which we decided not to use for modelling. In real business situations, we'd explore the reason for duplication and make the best decision based on more in-depth understanding.

Relationship to Target¶

We'd like to get a sense which columns are most strongly related to the target. We can do this by standard statistics and tests (chi2 or f-score). We leave the feature selection to the preprocessing notebook and explore select columns here that are promising predictors.

Only weak relationship between the numeric features.

Between our target and the numeric features:

After a bit of transformations, we can get a better visual:

Observations:

We could use a two-sample t-test for checking statistical significance of the difference in distributions (although need to be careful with the large sample size).

Finally, looking at our categorical fields, we can check pivot tables and chi2 scores, selecting the most significant results by a 0.001 p-value threshold to account for multiple testing:

We can see that these are reasonable columns, industry, class of worker, employment status, etc are highly likely to affect the income. We expect these to show up in our modelling as well and help guide prioritizing feature selection.